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Abstract. Programs that process data that reside in files are widely 
used in varied domains, such as banking, healthcare, and web-traffic anal¬ 
ysis. Precise static analysis of these programs in the context of software 
verification and transformation tasks is a challenging problem. Our key 
insight is that static analysis of file-processing programs can be made 
more useful if knowledge of the input file formats of these programs is 
made available to the analysis. We propose a generic framework that 
is able to perform any given underlying abstract interpretation on the 
program, while restricting the attention of the analysis to program paths 
that are potentially feasible when the program’s input conforms to the 
given file format specification. We describe an implementation of our ap¬ 
proach, and present empirical results using real and realistic programs 
that show how our approach enables novel verification and transforma¬ 
tion tasks, and also improves the precision of standard analysis problems. 


1 Introduction 

Processing data that resides in files or documents is a central aspect of comput¬ 
ing in many organizations and enterprises. Standard file formats or document 
formats have been developed or evolved in various domains to facilitate storage 
and interchange of data, e.g., in banking mm, health-care [H], enterprise- 
resource planning (ERP) [33], billing [T3], and web-trafRc analysis [5|. The wide 
adoption of such standard formats has led to extensive development of software 
that reads, processes, and writes data in these formats. However, there is a lack 
of tool support for developers working in these domains that specifically targets 
the idioms commonly present in file-processing programs. We address this issue 
by proposing a generic approach for static analysis of file-processing programs 
that takes a program as well as a specification of the input file format of the 
program as input, and analyzes the program in the context of behaviors of the 
program that are compatible with the file-format specification. 

1.1 Motivating example 

Our work has been motivated in particular by hatch programs in the context 
of enterprise legacy systems. Such programs are typically executed periodically. 




and in each run process an input file that contains “transaction” records that 
have accumulated since the last run. In order to motivate the challenges in 
analyzing file-processing programs, we introduce as a running example a small 
batch program, as well as a sample file that it is meant to process, in Figure 


DATA DIVISION. 

INPUT FILE in-file BUFFER in-rec. 

OUTPUT FILE out-file BUFFER out-rec. 
char same-flag, 
digit eof-flag = 0. 

PROCEDURE DIVISION. 

/!/ OPEN in-file, out-file 

/2/ READ in-file INTO in-rec, AT END MOVE 1 TO eof-flag 
/3/ WHILE eof-flag = 0 

/4/ IF in-rec.typ = 'ITM' l/ltem record processing 


75/ 

MOVE in-rec.rev, in-rec.amt TO out-rec. 

rev, out- 

767 

IF same-flag = 'S' 


777 

Itm record processing for SAME batch 

header 

787 

ELSE 


797 

Itm record processing for DIFF batch 

header 

7107 

END-IF 


7117 

WRITE out-file FROM out-rec 


7127 

ELSE IF in-rec.typ ='HDR' //Header Record Processing 

7137 

MOVE in-rec.pyr TO out-rec.pyr 


7147 

IF in-rec.sre = 'SAME' 


7157 

MOVE 'S' TO same-flag 


7167 

ELSE 


7177 

MOVE 'D' TO same-flag 


7187 

END-IF 


7197 

Rest of header record processing 


7207 

ELSE IF in-rec.typ ='TRL' //TraiLer Record Processing 

7217 

Tri record processing 


7227 

ELSE 


7237 

Terminate program with error 


7247 

END-IF 



/IS/ READ in-file INTO pmt-record, AT END MOVE 1 TO eof-flag 
/26/ END-WHILE. 

/27/ CLOSE in-file,out-file. 

728/ GOBACK. 

(a) 


HDR 
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9000 
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ITM 

10201 

3000 


ITM 

10103 

4000 


ITM 

18888 

2000 


TRL 


HDR 

20221 

6000 

DIFF 

ITM 

19999 
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ITM 

10234 

4000 


TRL 



(b) 


"^yp py^ 

|HDR| 10205190001 SAME] 


typ rev amt 
I ITM 11020113000] 

typ: main type. 

pyr: payer account number. 

tot: total batch amount. 

sre: source bank. 

rev: receiver account num. 

amt: item amount. 

(c) 


Fig. 1. (a) Example program, (b) Sample input file, (c) Input file record layouts. 


Input file format. Although our example is a toy one, the sample hie shown 
(in Figure [^b)) adheres to a simplihed version of a real banking format [15]. 
Each record is shown as a row, with helds being demarcated by vertical lines. 
In this hie format, the records are grouped logically into “batches”, with each 
batch representing a group of “payments” from one customer to other customers. 
Each batch consists of a “header” record (value ‘HDR’ in the hrst held), which 
contains information about the paying customer, followed by one or more “item” 
or “payment” records (‘ITM’ in the hrst held), which identify the recipients, 
followed by a “trailer” record (‘TRL’). Figure [^c) gives the names of the helds 
of header as well as item records. Other than the hrst held typ, which we have 
discussed above, another held of particular relevance to our discussions is the 
sre held in header records, which identihes whether the paying customer is a 
customer of the bank that’s running the program (‘SAME’), or of a different 




















bank (‘DIFF’). The meanings of the other fields are explained in part (c) of the 
figure. 

The code. Figure Sa) shows our example program, which is in a Cobol-like 
syntax. The “DATA DIVISION” contains the declarations of the variables used 
in the program, including the input file buffer in-rec and output file buffer 
out-rec. in-rec is basically an overlay (or union, following the terminology of 
the C language), of the two record layouts shown in Figurej^c). After any record 
is read into this buffer the program interprets its contents using the appropriate 
layout based on the value of the typ field. The output buffer out-rec is assumed 
to have fields pyr, rev, amt, as well other fields that are not relevant to our 
discussions. These field declarations have been elided in the figure for brevity. 

The statements of the program appear within the “PROCEDURE DIVISION”. 
The program has a main loop, in lines 3-26. A record is read from the input 
file first outside the main loop (line 2), and then once at the end of each iter¬ 
ation of the loop (line 25). In each iteration the most recent record that was 
read is processed according to whether it is a header record (lines 12-19), item 
record (lines 4-11), or trailer record (lines 20-21). The sole WRITE statement in 
the program is in the item-record processing block (line 11), and writes out a 
“processed” payment record using information in the current item record as well 
as in the previously seen header record. Lines 7 and 9 represent code (details 
elided) that populates certain fields of out-rec in distinct ways depending on 
whether the paying customer is from the same bank or a different bank. 


1.2 Analysis issues and challenges 

File-processing programs typically employ certain idioms that distinguish them 
from programs in other domains. These programs read an unbounded number 
of input records, rather than have a fixed input size. Furthermore, typically, a 
program is designed not to process arbitrary inputs, but only input files that 
adhere to a known (domain related) file-format. State variables are used in the 
program to keep track of the types of the records read until the current point 
of execution. These state variables are used to decide how to process any new 
record that is read from the file. For instance, in the program in Figure [^a), 
the variable same-flag is set in lines 14-17 to ‘s’ (for “same”) or to ‘d’ (for 
“different”), based on sre field of the header record that has just been read. 
This variable is then used in line 6 to decide how to process item records in 
the same batch that are subsequently read. In certain cases, the state variables 
could also be used to identify unexpected or ill-formed sequences, and to “reject” 
them. 

Analyzing, understanding, and transforming file-processing programs in pre¬ 
cise ways requires a unique form of path sensitive analysis, in which, at each pro¬ 
gram point, distinct information about the program’s state needs to be tracked 
corresponding to each distinct pattern of record types that could have been read 
so far before control reaches the point. We illustrate this using example ques- 


tions about the program in Figure [T] answers to which would enable various 
verihcation and transformation activities. 

Does the program silently “accept” ill-formed inputs? This is a natural and im¬ 
portant verification problem in the context of file-processing programs. In our 
running example, if an (ill-formed) input hie happens to contain an item record 
as the hrst record (without a preceding header record), the variable same-flag 
would be uninitialized after this item record is read and when control reaches 
line 6. This is because this variable is initialized only when a header record 
is seen, in lines 14-17. Therefore, the condition in line 6 could evaluate non- 
deterministically. Furthermore, the output buffer out-rec, which will be written 
out in line 11, could contain garbage in its pyr held, because this held also is 
initialized only when a header record is seen (in line 13). 

In other words, hle-processing programs could silently write out garbage val¬ 
ues into output hies or databases when given ill-formed inputs, which is undesir¬ 
able. Ideally, in the running example, the programmer ought to have employed 
an additional state variable (e.g., hdr-seen), to keep track of whether a header 
record is seen before every item record, and ought to have emitted a warning or 
aborted the program upon identifying any violation of this requirement. In other 
words, state tracking in hle-processing programs is complex, and prone to being 
done erroneously. Therefore, there is a need for an automated analysis that can 
check whether a program “over accepts” bad hies (i.e., hies that don’t adhere to 
a user-provided specihcation of well-formed hies). 

What program behaviors are possible with well-formed inputs? In other situa¬ 
tions, we are interested only in information on program states that can arise 
after (prehxes of) well-formed hies have been read. For instance, a developer 
might be interested in knowing about possible uses of unitialized variables dur¬ 
ing runs on well-formed hies only, without the clutter caused by warning reports 
pertaining to runs on ill-formed hies. Intuitively, only the hrst category of warn¬ 
ings mentioned above signihes genuine errors in the program. This is because in 
many cases developers do not try to ensure meaningful outputs for corrupted 
input hies. In our example program, there are in fact no instances of uninitialized 
variables being used during runs on well-formed hies. 

On a related note, one might want to know if a program can falsely issue 
an “ill-formed input” warning even when run on a well-formed hie. This sort of 
“under acceptance” problem could happen either due to a programming error, 
or due to misunderstanding on the developer’s part as to what inputs are to be 
expected. This could be checked by asking whether statements in the program 
that issue these warnings, such as line 23 in the example program, are reachable 
during runs on well-formed inputs. In the example program it turns out that 
this cannot happen. 

What program behaviors are possible under restricted scenarios of interest? In 
some situations there is a need to identify paths in a program that are taken 
during runs on certain narrower sub-classes of well-formed hies. For instance, in 


the running example, we might be interested only in the parts of the program 
that are required when input files contain batches whose header records always 
have ‘same’ in the src field; these parts constitute all the lines in the program 
except lines 6, 9, and 14-17 (variable same-flag will no longer need to be set 
or used because all input batches are guaranteed to be ‘SAME’ batches). This 
is essentially a classical program specialization problem, but with a file-format- 
based specialization criterion (rather than a standard criterion on the parameters 
to the program). Program specialization has various applications [7] , for example, 
in program comprehension, decomposition of monolithic programs to collections 
of smaller programs that have internally cohesive functionality, and reducing 
run-time overhead. 

1.3 Our approach and contributions 

Approach: Static analysis based on file formats. The primary contribution of 
this paper is a generic approach to perform any given “underlying” abstract in¬ 
terpretation of interest 17, based on an abstract lattice L, in a path-sensitive 
manner, by maintaining at each point a distinct abstract fact (i.e., element of L) 
for distinct patterns of record types that could have been read so far. Typically, 
to “lift” the analysis 17 to a path-sensitive domain, a finite set of predicates of P 
would be required m- The path-sensitive analysis domain would then essentially 
he P ^ L. For our example program, a set of six predicates, each one formed 
by conjuncting one of the three predicates “in-rec.typ = ‘HDR’”, “in-rec.typ 
= ‘ITM’” or “in-rec.type = ‘TRL’”, with one of the two predicates “same-flag 
= ’S’” or “same-flag = ’D’”, would be natural candidates. However, coming 
up with this set of predicates manually would be tedious, because it requires 
detailed knowledge of the state variables in the program and their usage. Auto¬ 
mated predicate refinement m might be able to generate these predicates, but 
is a complex iterative process, and might potentially generate many additional 
predicates, which would increase the running time of the analysis. 



Record 

Type 

Constraint 

SHdr 

typ — ‘HDR’A src — ‘SAME’ 

DHdr 

typ — ‘HDR’A src — ‘DIFF’ 

Itni 

typ — ‘ITM’ 

Trl 

typ — ‘TRL’ 


Fig. 2. (a) Well-formed input automaton, (b) Input record types. 


File-format specifications, which are usually readily available because they 
are organization-wide or even industry-wide standards, have been used by pre¬ 
vious programming languages researchers in the context of tasks such as parser 










and validator generation [18] , and white-box testing [19] . Our key insight is that 
if a file-format specification can be represented as a finite-state input automaton^ 
whose transitions are labeled with record types, then the set of states Q of this 
automaton (which we call file states) can be directly used to lift the analysis U, 
by using the domain Q ^ L. The intuition is that if an abstract fact I G L is 
mapped to a file state q G Q at a program point, then I over-approximates all 
possible concrete states that can arise at that point during runs that consume a 
sequence of records such that the concatenation of the types of these records is 
“accepted” by the file state q of the automaton. 

Figure [^a) shows the well-formed input automaton for the file-format used 
in our running example, with Figure ^b) showing the associated input record 
type descriptions (as dependent types [34]). The sample input file in Figure [^b) 
is a well-formed file as per this automaton. This is because the sequence that 
consists of the types of the records in this file, namely ‘SHdr Itm Itm Itm Trl 
DHdr Itm Itm Trl’, is accepted by the automaton. 

Intuitively, statements other than READ statements do not affect the file-state 
that a program is “in” during execution. Therefore, the “lifted” transfer functions 
for these are straightforward, and use the underlying the transfer functions from 
the L analysis. The transfer function for READ plays the key role of enforcing 
the ordering among record types in well-formed files. For instance, consider the 
file-state qsh in Figurej^a), which represents the situation wherein a ‘SAME’-type 
header record has just been read. Therefore, in the output of the READ transfer 
function, qsh is mapped to the join of the abstract facts that the predecessors 
file-states of qsh, namely, qs and qt were mapped to in the input to the transfer 
function. 


Applications. In addition to our basic approach above, we propose two applica¬ 
tions of it that address two natural problems in the analysis of file-processing 
programs, that to our knowledge have not been explored previously in the litera¬ 
ture. The first application is a sound approach to check if a program potentially 
“over accepts” ill-formed files, or “under accepts” well-formed files. The second 
is a sound technique to specialize a program wrt a given specialization criterion 
that represents a restriction of the full file-format, and that is itself represented 
as an input automaton. 


Program File State Graph (PFSG). We propose a novel program representation, 
the PFSG, which is a graph derived from both the control-flow graph (CFG) of 
the program and the given input automaton for the program. The PFSG is 
basically an “exploded” version of the CFG of the original program; the control- 
flow paths in the PFSG are a subset of the control-flow paths in the CFG, 
with certain paths that are infeasible under the given input automaton being 
omitted. Being itself a CFG, any existing static analysis can be applied on the 
PFSG without any modifications, with the benefit that the infeasible paths end 
up being ignored by the analysis. 


We describe how to modify our basic approach to emit a PFSG, and also 
discuss formally how the results from any analysis differ when performed on the 
PFSG when compared to being performed on the original CFG. 


Implementation and empirical results. We have implemented our approach, 
and applied it on several realistic as well as real Cobol batch programs. Our 
approach found file-format related conformance issues in certain real programs, 
and was also able to verify the absence of such errors in other programs. 
In the program specialization context, we observed that our approach was 
surprisingly precise in being able to identify statements and conditionals that 
need not be retained in the specialized program. We found that our analysis, 
when used to identify references to possibly uninitialized variables and reach¬ 
ing definitions gave improved precision over the standard analysis in many cases. 


The rest of this paper is structured as follows. In Section we introduce key 
assumptions and definitions. In Section we present our approach, as well as 
the two applications mentioned above. Section [^introduces the PFSG. Section|5] 
presents our implementation and result. Section discusses related work, while 
Section concludes the paper. 


2 Assumptions and definitions 

Definitions (Records, Record Types, and Files) A record is a contiguous sequence 
of bytes in a file. A field is a labeled non-empty sub-string of a record. Any record 
has zero or more fields (if it has zero fields then the record is taken to be a leaf- 
level record). 

A record type Ri is intuitively a specification of the length of a record, the 
names of its fields and their lengths, and a constraint on the contents of the 
record. For example, consider the record types shown in Figure [^b). Each row 
shows the name of a record type, and then the associated constraint. 

We say that a record r is of type Ri iff r satisfies the length as well as value 
constraints of type Ri. For instance, the first record in the file in Figure [^b) 
is of type SHdr (see Figure [^b)). Note that in general a record r could be of 
multiple types. 


Definitions (Files and read operations) A file is a sequence of records, of possibly 
different lengths. Successive records in a file are assumed to be demarcated ex¬ 
plicitly, either by inter-record markers or by other meta-data that captures the 
length of each record. At run time there is a file pointer associated with each 
open file; a READ statement, upon execution, retrieves the record pointed to by 
this pointer, copies it into the file buffer in the program associated with this file, 
and advances the file pointer. 


Definition (Input automaton) An input automaton S' is a tuple (Q, S, A, qg, Qe), 
where Q is a finite set of states, which we refer to as file states, A = TU {eo/}, 
where T is a set of record types, Z\ is a set of transitions between the file states, 
with each transition labeled with an element of S, Qs is the designated start state 
of S, and Qe is the (non-empty) set of designated final states of S. A transition 
is labeled with eof iff the transition is to a final state. There are no outgoing 
transitions from final states. 

Note that an input automaton may be non-deterministic, in two different 
senses. Multiple transitions out of a file state could have the same label. Also, it 
is possible for a record r to be of two distinct types ti and <2 and for these two 
types to be the labels of two outgoing transitions from a file state. 

Let q be any non-final state of Q. We define Lxiq) - the type language of q 
- as the set of sequences of types (i.e., elements of T*) that take the automaton 
from its start state to q. For a final state qe, LT{qe) is defined to be the union 
of type languages of the states from which there are transitions to qe- 

We define Lji{q) - the record language of any file-state q - as follows: Lji{q) 
consists of sequences of records R such that there exists a sequence of types T 
in LT{q) such that (a) the sequences R and T are of equal length, and (b) for 
each 1 < j < |i?|, record R[j] is of type T[j]. Recall that a file is nothing but a 
sequence of records. We say that a file / conforms to an input automaton S, or 
that S accepts /, if / is in Lji{qe) for some final state qe of S. 

Let R be some sequence of records (possibly empty). Say R is in Lfi{q) for 
some file state q of an input automaton. If there exists an execution trace t of 
the given program P that starts at the program’s entry, consumes the records in 
R via the READ statements that it passes through, and reaches a program point 
p, then we say that (a) trace t is due to the prefix R, and (b) trace t reaches 
point p while being in file-state q of the input automaton. We define ns{t) as 
the sequence of nodes of the control-flow graph (CFG) of given program that 
are visited by the trace t; the sequence always begins with the entry node of the 
CFG and contains at least two nodes (the trace ends at the point before the last 
node in the sequence). 

A well-formed input automaton (which we often abbreviate to “well-formed 
automaton”) is an input automaton that accepts all files that are expected to 
be given as input to a program. A “specialization” automaton is an input au¬ 
tomaton that accepts a subset of files as a well-formed automaton, while a “full” 
automaton is an input automaton that accepts every possible file. 

If a program accesses multiple sequential input files this situation could be 
handled using two alternative approaches: (I) By concurrently using multiple 
input automatons in the analysis, one per input file, or (2) By modeling one of 
the input files as the primary input file (with an associated automaton) and by 
modeling reads from the remaining files as always returning an undefined record. 
We adopt the latter of these approaches in our experimental evaluation. 



3 Our approach 


In this section we describe our primary contribution, which is a generic approach 
for “lifting” a given abstract interpretation wrt a file-format specification. We 
then discuss its soundness and precision. Following this we present the details of 
the two applications of our generic approach that were mentioned in Section [T31 
Finally, we present an extension to our approach, which enables the specification 
of data integrity constraints on the contents of input files in relation to the 
contents of persistent tables. 

3.1 Abstract interpretation lifted using input automatons 

The inputs to our approach are a program P, an input automaton S = 
{Q, S, A,qs,Qe), and an arbitrary “underlying” abstract interpretation U = 
{{L, Fl), where L is a join semi-lattice and is a set of transfer functions 
with signature L ^ L associated with statements and conditionals. Our objec¬ 
tive, as described in the introduction, is to use the provided input automaton 
to compute a least fix-point solution considering only paths in the program that 
are potentially feasible wrt the given input automaton. 

The lattice that we use in our lifted analysis is D = Q —>■ L. The partial 
ordering for this lattice is a “point wise” ordering based on the underlying lattice 
L: 

di Qd d 2 =def & Q-di{q) Ql d 2 {q) 

The initial value that we supply at the entry of the program is (gs, fi), where 

S L is an input to our approach, and is the initial value to be used in the 
context of the underlying analysis. 

We now discuss our transfer functions on the lattice D. We consider the 
following three categories of CFG nodes: Statements other than READ statements, 
conditionals, and READ statements. Let n be any node that is neither a READ 
statement nor a conditional. Let G Pl be the “underlying” transfer 

function for node n. Since the file state that any trace is in at the point before 
node n cannot change after the trace executes node n, our transfer function for 
node n is: 

mdGD) = \qGQ.fl{d{q)) 

Let c be a conditional node, with a true successor and a false successor. Let 
G Fl and ^ G Fl be the underlying true-branch and /ake-branch transfer 
functions of c. Since a conditional node cannot modify the file-state that a trace 
is in, either, our transfer function for conditionals is: 

flD{dGD) = \qGQ.fldd{q)) 

where ‘6’ stands for t or /. 

Finally, we consider the case where a node r is a READ node. This is the most 
interesting case, because executing a READ statement can change the hie state 
that a trace is in. Firstly, a note on terminology: a datahow value I G L is said to 



Fig. 3. Illustration of transfer function for READ statements. 


represent a concrete state s if s is an element of the eoncretization of I (which 
is written as 7(0)- Secondly, we make the following assumption on the underlying 
transfer function /£ G for READ statements: Rather than simply have the 
signature L ^ L, the function /£ ought to have the signature {L x S) ^ L. If t 
is some record type (i.e., element of T), then, intuitively, should return 

a dataflow fact I 2 that represents the set of concrete states that can result after 
the execution of the READ, assuming: 

— the concrete state just before the execution of the READ is some state that is 
represented by Zi, and 

— the READ statement retrieves a record of type t from the input file and places 
it in the input buffer. 

Correspondingly, /£(Zi, eof) should return a dataflow fact I 2 that represents 
the set of concrete states that can result after the execution of the READ, assuming: 

— the concrete state just before the execution of the READ is some state that is 
represented by Zi, and 

— the input buffer in the program gets populated with an undefined value, and 

— the statement within the ‘AT END’ clause, if any, executes after the read op¬ 
eration. 

As an illustration, say the underlying analysis U is the CP (Constant Prop¬ 
agation) analysis. /£(Zi,f) would return a fact I 2 that is obtained by performing 
the following transformations on Zi: (1) remove all existing CP facts associated 
with the input buffer, and (2) obtain suitable new CP facts for the input buffer 
using the constraints associated with the type t. On the other hand, /£(Zi, eof) 
would perform only Step (1) above. □ 

We are now ready to present our transfer /£, for a read node r. The transfer 
function is: 

fhid) = Mj^Q- U {fl{d{qf),lahel{S,q„qj))} 

(qi^qj)&A 

where label{S, qi, qj) returns the label (which is a type, or eof) of the transi¬ 
tion qi —> qj in S. The intuition behind this transfer function is as follows. For 
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Fig. 4. Fix point solution for program in Figure]^ a), using CP x Possibly-unitialized- 
variables analysis. Abbreviations used: e: eof-flag, s: same-flag, i: in-rec, o: 
out-rec, i.t: in-rec. typ, i.s: in-rec. src, o.r: out-rec.rcv, o.a: out-rec.amt, 
’H’: ’HDR’, ’S’: ’SAME’, ’D’: ’DIFF’, ’ITM’, ’T’: ’TRL’. 


any file state a trace can be in qj after executing the READ if the trace is in 
any one of the predecessor states of qj in the automaton just before executing 
the READ. Therefore, the fact (from lattice L) that is to be associated with qj 
at the point after the READ statement can be obtained as follows: (1) For each 
file state qi such that there is a transition qi —>■ qj labeled s in the automaton, 
transfer the fact f‘£{li,s), where h £ L is the fact that qi is mapped to at the 
point before the READ statement. (2) Take a join of all these transferred facts. 

Figure [^sketches this transfer function schematically. Each edge from a col¬ 
umn before the READ statement to a column after the READ denotes a “transfer” 
that happens due to Step (1) above; the label on the edge denotes the label 
on the corresponding transition in the automaton. We have abbreviated each 
instance of /£ in this figure as /. We have also omitted some of the columns for 
compactness. 

Our presentation above was limited to the intra-procedural setting. However, 
our analysis can be extended to the inter-procedural setting using standard tech¬ 
niques, some details of which we discuss in Section 










































Illustration. Figure shows the fix-point solution at certain program points 
for the example program in Figure as computed by our analysis, using the well- 
formed automaton shown in Figure We assume an underlying lattice L that is 
a product of the constant-propagation (CP) and possibly uninitialized variables 
(Uninit) lattices. Each table in the figure denotes the solution (i.e., a function 
from Q to L) at the program point that precedes the statement that is indicated 
below the table. Each column of a table shows the underlying dataflow value 
associated with a file state. Columns in which the underlying dataflow value is 
_L (which represents unreachability, basically) are omitted from the tables for 
brevity. The first component of each dataflow value - within angle brackets - 
indicates the constant values of variables, while the second component - within 
curly braces - indicates the set of variables that are possibly uninitialized. Empty 
sets are omitted from the figure for brevity. We abbreviate the variable names as 
well as constant values for the sake of compactness, as described in the caption 
of the figure. 

In the interest of space, we focus our attention on just one of the program 
points - the point just before the IF condition in line 4. Any execution trace 
reaching this point can be in any one of the following four file states: Qs/i, <ldh, Qi, 
or Qt- These file states have associated CP facts that indicate that in-rec.typ has 
value ‘HDR’, ‘HDR’, ‘ITM’, and ‘TRL’, respectively. Additionally, the state variable 
same-flag is possibly uninitialized under columns qsh and qdh, because lines 14- 
17 (which initialize this variable) may not have been visited yet, whereas is 
initialized under the other two columns. Now, only the fact associated with qi 
flows down the true branch of the conditional in line 4. This is because this 
conditional tests that in-rec.typ contains ‘ITM’. Therefore, same-flag is inferred 
to be definitely initialized by the time it is referenced in line 6, which is the 
desired precise result. 


3.2 Soundness, precision, and complexity of our approach 

Our soundness result, intuitively, is that if the underlying analysis U is sound, 
then so is our lifted analysis, modulo the assumption that any input file given 
to the program P conforms to the given input automaton S. U itself is said 
to be sound if at any program point p of any program P the fix-point solution 
I computed by U represents all concrete states that can result at p due to all 
possible execution traces of P that begin in any concrete state that is repre¬ 
sented by the given initial value i^- The following theorem states the soundness 
characterization of our analysis more formally. 

Theorem 1. Assuming U is a sound abstract interpretation, if d is the fix-point 
solution produced by our approach at a program point p of the given program P 
starting with initial value (psAl), then I = UqgQ{(i(( 7 )} represents all concrete 
states that can result at point p due to all possible executions that begin in any 
concrete state that is represented by i^ and that are on an input file that conforms 
to the given automaton S. 


The proof of the theorem above is straightforward. 

The following observations follow from the theorem above. (1) If the given 
input automaton is a well-formed automaton, then the fix-point solution at a 
program point p represents all concrete states that can result at point p during 
executions on well-formed files. (2) If the given input automaton is a “full” au¬ 
tomaton, then the fix-point solution at a program point p represents all concrete 
states that can result at point p during all possible executions (including on 
ill-formed files). 

Given any input automaton S', our approach will produce a fix-point solution 
that is at least as precise as the one that would be produced directly by the 
underlying analysis U. However, the choice of the input automaton does impact 
the precision of the our approach. Intuitively, if input automaton Si is a sub¬ 
automaton of input automaton S 2 (i.e., is obtained by deleting certain states 
and transitions, or by constraining further some of the types that label some of 
the transitions), then Si will result in a more precise solution than S 2 . Also, if 
Si and S 2 accept the same language, but Si structurally refines S 2 , then Si will 
result in a more precise solution. We formalize these notions in the appendix. 

The time complexity of our analysis when used with an automaton S that has 
a set of states Q is, in the worst case, \Q\ times the worst-case time complexity 
of the underlying analysis U. 

3.3 Two applications of our analysis 

In this section we describe how we use our analysis described in Section |3.I| 
to address the two new problems that we mentioned in the introduction - file 
format conformance checking, and program specialization. 

File format conformance checking. As mentioned in the introduction, a ver¬ 
ification question that developers would like an answer to is whether a program 
can silently “accept” an ill-formed input file and possibly write out a corrupted 
output file (“over acceptance”). Or, conversely, could the program “reject” a 
well-formed file via an abort or a warning message (“under acceptance”)? 

Different programs use different kinds of idioms to “reject” an input file; 
e.g., generating a warning message (and then continuing processing as usual), 
ignoring an erroneous part of the input file and processing the remaining records, 
and aborting the program via an exception. In order to target all these modes 
in a generic manner, our approach relies on the developer to identify file-format 
related rejection points in a program. These are the statements in a program 
where format violations are flagged, using warnings, aborts, etc. 

Detecting under-acceptance. We detect under-acceptance warnings by (1) Apply¬ 
ing our analysis using a well-formed automaton and using any given program- 
state abstraction domain (e.g., CP, or interval analysis) as U. (2) Issuing an 
under-acceptance warning if the fact associated with any file-state is non-T at 
any rejection point. The intuition is simply that rejection points should be un¬ 
reachable when the program is run on well-formed files. Since our analysis is 


conservative, in that it never produces under-approximated dataflow facts, this 
approach will not miss any under-acceptance issues as long as the developer does 
not fail to mark any actual rejection point as a rejection point. 

As an illustration, say line 23 in the program in Figur e [T] is marked as a 
rejection point. Using the well-formed automaton in Figure ^Fa) and using CP 
as the underlying analysis U, our analysis will find this line to be unreachable. 
Therefore, no under-acceptance warnings will be issued. 

Detecting over-acceptance. Intuitively, a program has over acceptance errors 
with respect to a given well-formed file format if the program can reach the end 
of the main procedure without going through any rejection point when run on 
an input file that does not conform to the well-formed automaton. We check 
this property as follows: We first extend the given well-formed automaton S to 
a full automaton (which accepts all input files) systematically by adding a new 
final state Qx , a few other new non-final states, and new transitions that lead to 
these new states from the original states. The intent is for these new states to 
accept record sequences that are not accepted by any file state in the original 
automaton S. We provide the full details of this construction in the appendix. 
Secondly, we modify the transfer functions of our lifted analysis D at rejection 
points such that they map all file states to T in their output. Intuitively, the 
idea behind this is to “block” paths that go through rejection points. 

We then apply our analysis using this full automaton and using any program- 
state abstraction domain as U, and flag an over-acceptance warning if the 
dataflow value associated with any file state that is not a final state in the 
original well-formed automaton is non-T at the final point of the “main” proce¬ 
dure. Clearly, since our analysis over-approximates dataflow facts at all program 
points, we will not miss any over-acceptance scenarios as long as the developer 
does not wrongly mark a non-rejection point as a rejection point. 


Program specialization based on file formats. As mentioned in the in¬ 
troduction, it would be natural for developers to want to specify specialization 
criteria for file-processing programs as patterns on sequences of record types in 
an input file. We propose the use of input automatons for this purpose. For 
example, if the well-formed automaton in Figure [^a) were to be modified by 
removing the file state qdh, as well as all transitions incident on it, what would 
be obtained would be a specialization automaton that accepts files in which all 
batches begin with “same” headers only. 

Our approach to program specialization using a specialization automaton is 
as follows. (1) We apply our analysis of Section 3.1 using the given specialization 
automaton as the input automaton, and using any program-state abstraction 
domain as U. (2) We identify program points p at which every file state is 
mapped to T as per the fix-point solution computed in Step 1. Basically, these 
program points are unreachable during executions on input files that conform 
to the specialization criterion. The statements/conditionals that immediately 
follow these points can be projected out of the program to yield the specialized 
program (the details of this projection operation are not a focus of this paper). 



It is easy to see that our approach is sound, in that it marks a point as 
unreachable only if it is definitely unreachable during all runs on input files that 
adhere to the given criterion. 

Illustration. Using the specialization automaton mentioned above, and using 
CP as the underlying analysis, lines 9 and 17 in the code in Figure [^a) are 
marked as unreachable. It is worthwhile noting that if one did not use the spe¬ 
cialization automaton as criterion, and instead simply specified that all header 
records have value ‘SAME’ in their ‘src’ field, then line 9 would not be identified as 
unreachable. Intuuitively, this is because the path consisting of lines 1-6, along 
which same-flag is uninitialized, would not be found infeasible, as discussed in 
Section 11.21 

Subsequently, as a post-processing step (which is not a part of our core 
approach), the following further simplifications could be done to the program: 
(i) Make lines 7 and 15 unconditional, and remove the respective controlling 
“if” conditions. This would be safe because the “else” branches of these two “if” 
conditions have become empty, (ii) Remove line 15 entirely. This would be safe 
because after the conditional in line 6 is removed the variable same-flag is not 
used anywhere in the program. 


3.4 Imposing data integrity constraints on input files 

The core of our approach, which was discussed in Section |3.H used input au¬ 
tomatons that constrain the sequences of types of records that can appear in an 
input file. However, in many situations, a well-formed file also needs to satisfy 
certain data integrity constraints wrt the contents of certain persistent tables. If 
these constraints can also be specified in conjunction with the input automaton, 
then certain paths in the program that execute only upon the violation of these 
constraints can be identified and pruned out during analysis time. This has the 
potential to further improve the precision and usefulness of our approach. 

In our running example, say there is a requirement that the “receiver” of 
any payment in the input file (represented by the in-rec.rcv field in the item 
record) necessarily be an account holder in the bank. Such a requirement could 
be enforced in the code in Figurej^by adding logic right after line 4 to check if the 
value in in-rec.rcv appears as a primary key in the “accounts” database table 
of the bank, and to not execute lines 5-11 if the check fails. However, if a user of 
our approach wishes to assert that input files will never contain items that refer 
to invalid account numbers, then the logic mentioned above could be identified 
as redundant. To enable users to give such specifications we allow predicates 
of the form isInTable{Tab,field) and isNotInTable{Tab,field) to be associated 
with record type definitions, where Tab is the name of a persistent table, and 
field is the name of a field in the record type. For example, the Itm type in 
Figure [^b) could be augmented as “typ = ‘ITM’ A is/nT'a&le(accounts,rcv)”, 
where accounts is the master accounts table. The semantics of this is that the 
value in the rev field is guaranteed to be a primary key of some row in the 


table accounts. Similarly, isNotInTable{Tab,field) asserts that the value in field 
is guaranteed not to be a primary key of any row in Tab. 

We assume our programming language has the following construct for key- 
based lookup into a table Tab: 

READ Tab INTO buffer KEY variable, INVALID KEY statements-N statements-F 
The semantics of this statement is as follows: If a table row with a key matching 
the value in variable is found in the table Tab, then it will be copied into buffer 
and control is given to statements-F. If no matching key is found, then the buffer 
content is undefined and control is given to statements-N. 

With this enhancement of record type specifications, we extend our analysis 
framework as follows. The new lattice we use will he D = Q ^ {S x L) where 
L is the given original underlying lattice. S' = 2*" and C is the set of all possible 
predicates of the two kinds mentioned above. 

We now describe the changes required to the transfer functions. The transfer 
function of (normal) READ statements that read from input files that we described 
in Section |3.1| is to be augmented, as follows. Whenever a record of a certain 
type t S T is read in, any predicates in the incoming fact that refer to fields of 
the input buffer are removed, and the predicates associated with t are included 
in the outgoing fact. 

The transfer function of the statement “MOVE X TO Y” copies to the outgoing 
fact all predicates in the incoming fact that refer to variables other than Y. 
Further, for each predicate in the incoming fact that refers to X, it creates a copy 
of this predicate, makes it refer to Y instead of X, and adds it to the outgoing 
fact. 

Transfer functions of conditionals do not need any change. 

Finally, we need to handle key-based lookups, which is the most interesting 
case. Consider once again the statement: 

READ T INTO buffer KEY v, INVALID KEY statements-N statements-F 

The transfer function first checks if a predicate of the form isInTable{ T,v) is 
present in the incoming dataflow fact. If it does, it essentially treats statements- 
N as unreachable. Else, if a predicate of the form isNotInTable{ T,v) is present in 
the incoming fact, it essentially treats statements-F as unreachable. Otherwise, 
it treats both statements-F and statements-N as reachable. 

A more formal presentation of these transfer functions is omitted from this 
paper in the interest of space. 


4 The Program File State Graph (PFSG) 


In this section we introduce our program representation for file-processing pro¬ 
grams, the Program File State Graph (PFSG). We then formalize the properties 
of the PFSG. Finally, we discuss how the PFSG serves as a basis for performing 
other program analyses without any modifications, while enabling them to ignore 
certain CFG paths that are infeasible as per the given input automaton. 


4.1 Structure and construction of the PFSG 


The PFSG is a representation that is based on a CFG G of a file-processing 
program P as well as on a given input automaton S for P. 

The PFSG is basically an exploded CFG. If the set of states in S (i.e., file 
states) is gs, <Zi, 92 , • ■ ■, 9e, then, for each node m in the CFG, we have nodes 
{m,qs), (to, gi), ( 771 , 92 ),..., {m,qe) in the PFSG. In other words, the PFSG has 
7V|(5| nodes, where N is the number of nodes in G and Q is the set of file-states of 
S. A structural property of the PFSG is that an edge is present between nodes 
{m,qi) and (n,qj) in the PFSG only if there is an edge from to to n in the 
CFG. In other words, any path {m,qi) —>■ {n,qj) ^ (^>90 in the PFSG 

corresponds to a path to —?> n r in the CFG. Let sc be the entry node 

of the CFG. The node (sg,9s) is regarded as the entry node of the PFSG. 

The PFSG can be constructed in a straightforward manner using our basic 
approach that was described in Section [3T| The precision of the PFSG is linked 
to the precision of the underlying analysis U that is selected. For example, the 
standard CP (constant-propagation) analysis could be used as U. If more preci¬ 
sion is required a more powerful lattice, then, for instance, a relational domain 
(wherein each lattice value represents a set of possible valuations of variables), 
such as the Octagon domain [2S], could be used. Once the fix-point solution is 
obtained from the approach, edges are added to the PFSG as per the following 
procedure. 

For each edge to —)■ 71 in the CFG: 

Rule 1, applicable if to is a “read” node: For each transition g^ —>■ qj in the 
automaton S, add an edge (m,qi) —)■ (n,qj) in the PFSG. 

Rule 2, applicable if to is not a “read” node: For each q G Q add an edge 
(to, g) —>• (77, g) in the PFSG. 

In both the rules above we add an edge (to, qk) —>■ (n, qi) only if dm(<lk) J-l 
and dn(qi) -Ll, where dm and are the fix-point solutions at to and n, 
respectively. We follow this restriction because dm{qk) (resp. d„(g/)) is T only 
when there is no execution trace that can reach to (resp. n) due to a sequence 
of records that is in Lj^fqk) (resp. Lji(qi)). 

The intuition behind the first rule above is that when a “read” statement 
executes, it modifies the file-state that the program could “be in”; intuitively, this 
is a file-state that the input automaton could be in were we to start simulating 
the automaton from gg when the program starts executing, and transition to 
an appropriate target state upon the execution of each “read” statement based 
on the type of the record read. When executing a “read” statement a program 
could transition from a file-state qi to hle-state qj only if such a transition is 
present in the input automaton. 

The second rule above does not “switch” the file-state of the program, because 
statements other than “read” statements affect the program’s internal state (i.e, 
valuation of variables) but not the hle-state that the program is in. 



Fig. 5. PFSG for program in Fig. [^a) and input automaton in Fig. 


4.2 Illustration of PFSG 


For illustration, consider the PFSG shown in Figure [5| corresponding to the pro¬ 
gram in Figure[^a) and input automaton in FigureRecall that this automaton 
describes all well-formed Hies where headers, item records, and trailers appear 
in their correct positions. Visually, the figure is laid out in six columns, corre¬ 
sponding to the six file states in the input automaton. The nodes in the PFSG 
are labeled with the corresponding line numbers from the program. Therefore, 
e.g., the node labeled /!/ in the qs column is actually node {/l/,qs), where /!/ 
represents the OPEN statement in line 1 of the program. On a related note, qs 
being the start state of the input automaton, and line 1 being the entry node of 
the program, the node mentioned above is in fact the entry node of the PFSG. 
Gertain parts of the PFSG are elided for brevity, and are represented using 
the cloud patterns. This PFSG was generated using a fix-point solution from 


our approach of Section 3.1 using GP (constant propagation) as the underlying 
analysis U. A fragment of this fix-point solution was shown in Figure]^ (the sets 
of possibly uninitialized variables in that figure can be ignored in the current 
context.) 

We now discuss in more detail a portion of the PFSG in Figure]^ with em¬ 
phasis on how it elides certain infeasible paths that are present in the original 
GFG. Line 2 in the program is a READ statement. As per the given input au¬ 
tomaton the outgoing transitions from qs go to qsh and to qdh- Therefore, as per 


Rule 1 of our PFSG edge-addition approach (see Section 4.1 above), there are 
outgoing edges from {/2/,qs) to copies of node 3 in the qsh and qdh columns 
(for clarity we have labeled these edges with the types on the corresponding 
input-automaton transitions). The qsh column (i.e., the second column) essen¬ 
tially consists of a copy of the loop body, specialized to the situation wherein 
the last record read was of type SHdr (SHdr being the type on all transitions 






















coming into qsh)- In particular, note that the true edge out of /4/ to /5/ in the 
Qsh column is elided. This is because in the hx-point solution (see Figure]^ the 
CP (constant propagation) fact associated with the Qsh file state at the point 
before /4/ indicates that in-rec.typ has value ‘HDR’ (this fact is abbreviated as 
“i.t = ‘H’” in the figure). Therefore, in the hx-point solution, the underlying 
fact associated with the qsh file state out of this edge ends up being T, which 
results in Rule 2 adding only the false edge from /4/ to /12/. 

Line 25 in the program being a READ statement, there is an edge from node 
(/25/, qsh) at the bottom of the qsh column to the entry of column qi [qi being 
the sole successor of qsh in the input automaton). The qi column consists of a 
copy of the loop body, specialized to the situation wherein the previous record 
read is of type Itm (this being the type on transitions coming into qi). From the 
end of the qi column control goes to the qt column, and from the end of that 
column back to the beginning of the qsh and qdh columns. 

It is notable that the structure of the PFSG is inherited both from the CFG 
and from the input automaton. As was mentioned in the discussion above, con¬ 
trol transfers from one column to another mirror the transitions in the input 
automaton, while paths within a column are inherited from the CFG, but spe¬ 
cialized wrt the type of record that was last read. 

4.3 Program analysis using PFSG 

Any program analysis that can be performed using a GFG can naturally be 
performed unmodified using a PFSG, by simply letting the analysis treat each 
node (n, qi) as being the same statement/conditional as the underlying node n. 

Such an analysis will be no less precise than with the original CFG of the 
program. This is because, by construction, every path in the PFSG corresponds 
to a path in the original CFG; in other words, there are no “extra” paths in the 
PFSG. To the contrary, certain CFG paths that are infeasible as per the given 
input automaton could be omitted from the PFSG. In other words, precision of 
the analysis is improved by ignoring executions due to certain infeasible inputs. 
For instance, in the example that was discussed above, due to the omitted edge 
from /4/ to /5/ in the qsh column, there is no path in the PFSG that visits 
copies of the nodes /!/, /2/, /3/, /4/, /5/, and /6/, in that order, even though 
such a path exists in the original CFG. In other words, the PFSG encodes the 
fact that under the given input file format an “item” record cannot occur as the 
first record in an input file. (However, in general, due to possible imprecision in 
the given underlying analysis U, not all paths that are infeasible as per the given 
input automaton would necessarily be excluded from the PFSG.) 

To illustrate the benefits of program analysis using the PFSG, we discuss 
two example analyses below: 

— Say we wish to perform possibly uninitialized variables analysis. Due to the 
path /1/-/2/-/3/-/4/-/5/-/6/ in the original CFG, the use of same-flag in 
line /6/ would be declared as possibly uninitialized. However, under the given 
input automaton, since every path that reaches line /6/ in the PFSG reaches 


it via lines /15/ or /17/ (which both defined same-flag), the use mentioned 
above would be declared as definitely initialized when the possibly-initialized 
analysis is performed on the PFSG. 

— A CP (constant-propagation) analysis, when done on the PFSG in Fig¬ 
ure would indicate that at the point before line /25/ same-flag would 
not have a constant value. However, when the same analysis is done on the 
PFSG, the same analysis would indicate that if (a) if in-rec .type is ‘HDR’ and 
in-rec.src is ‘SAME’ then same-flag has value ‘S’, (b) if in-rec.type is ‘HDR’ 
and in-rec . src is ‘DIFF’ then same-flag has value ‘D’, and (c) same-flag is not 
a constant otherwise. These correlations are identified because the PFSG is 
“exploded”, hence segregates CFG paths that end at the same program point 
but are due to record sequences that are accepted by different file-states of 
the input automaton. Correlations such as the one mentioned above cannot 
be identified, in general, using the CFG unless very expensive domains (such 
as relational domains) are used. 

The two instances of precision improvement mentioned above can also be 
obtained using our approach of Section [XT] if we use CP x Uninit as the under¬ 
lying domain U (where Uninit is the possibly-uninitialized analysis) for the first 
instance, and if we simply use CP as the underlying domain U for the second 
instance. However, in general, there are several scenarios where the PFSG serves 
better as a foundation for performing program analysis than the approach of 
Section 13.11 

— The approach of Section |3.1| applies only to forward dataflow analysis. 
Whereas, the PFSG can be used for forward as well as backward dataflow 
analysis problems. 

— The PFSG as a basis for applying static analysis techniques other than 
dataflow analysis, such as symbolic execution, model-checking, assertional 
reasoning, etc. Implementations of these techniques that are designed for 
CFGs can be applied unmodified on the PFSG. All these analyses are likely 
to benefit from the pruning of paths from the PFSG that are infeasible as 
per the given input automaton. 


4.4 Formal properties of the PFSG 

Soundness. We now characterize the paths in the original CFG that are 
necessarily present in the PFSG. This result forms the basis for the soundness 
of any static analysis that is applied on the PFSG. 

Theorem: Let U = ((T, Ul),Fl) be a given underlying sound [S] abstract 
interpretation. Consider any execution trace t of the program P that begins with 
a concrete state that is represented by the given initial dataflow fact & L. Let 
I be the sequence of records due to which t executes, and T be the number of 
nodes in ns{t). 

If 


(a) I is in Lii{q) for some non-final file-state q of S, and t did not encounter 
end-of-file upon a read, or 

(b) I is in Lf({q) for some final file-state q of S, and the last “read” in t 
encountered end-of-file 

Then there is a path t' in the PFSG such that 

(a) the first node of t' is {sc,qs), 

(b) for all i G [2,T], if the ith node of ns{t) is some node m then the ith 
node of t' is of the form (m, qj) for some file-state qj, and 

(c) the last node of t' is of the form (m, q), for some m. □ 

Intuitively, the theorem above states that for all execution traces that are due 
to record sequences that are accepted by the given input automaton, control-flow 
paths taken by these traces are present in the PFSG. 

In the specific scenario where the PFSG is used to perform a dataflow 
analysis, then the theorem above can be instantiated as follows. 

Corollary: Let D be any sound dataflow analysis framework [^, based on 
a semi-join lattice. Let dp be a given dataflow fact at program entry (dp is an 
element of I?’s lattice). For any node n of the original CFG G, let s{n) denote 
the final hx-point solution at n computed using D on the CFG using initial value 
dp. Consider a PFSG for G obtained using a given input automaton S. For any 
node (n,qi) of the PFSG, let s{n,qi) denote the final fix-point solution at n 
computed by I?, but applied on the PFSG, using the same initial value dp. 

Let s'(n) = Uq. is a file-state of s{sin,q^)}. 

(a) s'(n) C s{n). [Precision] 

(b) j(s'(n)) over-approximates the set of concrete states that can arise 
at node n when the program is run on input files that conform to S. [Soundness\\I\ 


Precision ordering among PFSGs. As was clear from the discussion in this 
section, the PFSG produced by our approach for a given CFG G and input 
automaton S is not fixed, but depends on the selected underlying abstract in¬ 
terpretation U. The theorem given above states that no matter what abstract 
interpretation is used as U, the PFSG is sound (i.e., does not elide any paths 
that can be executed due to record sequences that are accepted by S) as long as 
U is sound. However, the precision of the PFSG depends the precision of U. 

Given a CFG G and an input automaton S', we can define a precision 
ordering on the set of PFSGs for G and S that can be obtained using different 
(sound) underlying domains Ui,U 2 , etc.. A PFSG Pi can be said to be at-least 
as precise as another PFSG P 2 if every edge {in,qi) —?> (n,qj) in Pi is also 
present in P 2 . (Note that this implies that every path in Pi is also present in P 2 .) 

Theorem: If an underlying domain U 2 is a consistent abstraction of another 
underlying domain C/i, then the PFSG obtained for G and S using Ui is at least 
as precise as the PFSG obtained for G and S using t/ 2 . □ 


5 Implementation and evaluation 


Prog. 

name 

LoG 

No. of 
CFG 
Nodes 

Well-formed 

Automaton 

Full 

Automaton 

lur 

1^1 

lur 


ACCTRAN 

155 

73 

4 

8 

5 

15 

SEQ2000 

219 

115 

5 

15 

6 

25 

DTAP 

632 

275 

5 

6 

10 

41 

GLIEOPP 

1421 

900 

21 

48 

- 

- 

PROGl 

1177 

762 

8 

16 

12 

47 

PROG2 

1052 

724 

6 

11 

12 

49 

PROG3 

2780 

1178 

17 

28 

20 

50 

PROG4 

49846 

32258 

13 

34 

- 

- 


Fig. 6. Benchmark program details 


We have targeted our implementation at Cobol batch programs. These are 
very prevalent in large enterprises [^, and are based on a variety of standard 
as well as proprietary file formats. Another motivating factor for this choice is 
that one of the authors of this paper has extensive professional experience with 
developing and maintaining Cobol batch applications. We have implemented 
our analysis using a proprietary program analysis framework Prism |23] . Our 
implementation is in Java. We use the call strings approach m for precise 
context-sensitive inter-procedural analysis. Cobol programs do not use recursion; 
therefore, we place no apriori bound on call-string lengths. 

Our analysis code primarily consists of an implementation of our generic anal¬ 
ysis framework, as described in Section [3.1[ We have currently not implemented 


our extension for data integrity constraints that was described in Section 3.4 


nor have we implemented our PFSG construction approach (Section |^. We also 
have some lightweight scripts that process the fix-point solution emitted by the 
analysis to compute results for the specialization problem as well as the file 
conformance problem (see Section 3.3). 

We ran our tool on a laptop with an Intel i7 2.8 GHz CPU with 4 GB RAM. 


5.1 Benchmark programs 

We have used a set of eight programs as benchmarks for evaluation. Figure]^ 
lists key statistics about these programs. The second and third columns give the 
sizes of these programs, in terms of lines of code (including variable declarations) 
and in terms of number of (executable) nodes in the GFG (as constructed by 
Prism). The program AGGTRAN is a toy program that was used as a running 
example in a previous paper |81j . SEQ2000 is an example inventory management 
program used in a textbook m to showcase a typical sequential file processing 
program. The program DTAP has been developed by the authors of this paper. 
It is a payments validation program. The file-format it uses and the validation 
rules it implements are both taken from a widely used standard specification m- 
The program GLIEOPP is a payment validation and transformation program. 




















It was developed by a professional developer at a large IT consulting services 
company for training purposes. The format and the validation rules it uses are 
from another standard specification [T^]. PROGI and PROG2 are real-world 
programs used in a bank for validating and reporting “return” payments sent 
from branches of the bank to the head-office. PROGS and PROG4 are real-world 
programs from major multinational financial services companies. The program 
PROGS is a format translator, which translates various kinds of input records 
to corresponding output records. PROG4 reads data from a sequential master 
file, collects the data required for computing monthly interest and fee for each 
account, and writes this data out to various output files. The file formats used 
in these four real-world programs are proprietary. 

Golumns 4 and 5 in Figurej^give statistics about the well-formed automaton 
for each program. For the programs ACCTRAN and SEQ2000 the respective 
original sources of these programs also give the expected input file formats. For 
the real programs PROGI, PROG2, and PROG3, we derived the record types as 
well as well-formed automatons by going through the programs and guessing the 
intended formats of the input files to these programs. For program PROG4, the 
maintainers have provided us the file format specification. In the case of programs 
DTAP and GLIEOPP, we constructed the record types as well as well-formed 
automatons from their respective standard input-file specifications. In all cases 
we employed a precision-enhancing thumb-rule while creating the automatons, 
namely, that all incoming transitions into a file state be labeled with the same 
type. 

Eor most of the programs we also constructed a full automaton, to use in the 
context of “over acceptance” analysis. We created each full automaton using the 
corresponding well-formed automaton as a basis, following the basic procedure 
described in Section ^ Statistics about these full automatons are presented 
in the last two columns in Pigure We did not create a full automaton for 
GLIEOPP and PROG4, because the full automatons for these program turn out 
to be large and unwieldy to specify. Instead, we used the well-formed automatons 
in place of the full automatons in over-acceptance analysis, which can cause 
potential unsoundness. 

We evaluate our approach in three different contexts - its effectiveness in 
detecting file-format conformance violations in programs, its usefulness in spe¬ 
cializing file-processing programs, and its ability to improve the precision of a 
standard dataflow analysis. 


5.2 File format conformance checking 

As a first step in this experiment, we manually identified the rejection points 
for each program. This was actually a non-trivial task, because each program 
had its own idioms for rejecting files. Some programs wrote warnings messages 
into log files, others used system routines for terminating the program, while 
others used Cobol keywords such as GOBACK and STOP RUN. Furthermore, since 
not every instance of a warning output or termination is necessarily due to file 
format related issues, we had to exercise care in selecting the instances that were 



Prog. 

Name 

File format conformance warnings 

Under acceptance 

Over acceptance 

ACCTRAN 

0 

1 

SEQ2000 

3 

1 

DTAP 

0 

1 

CLIEOPP 

13 

* 0 

PROGl 

5 

9 

PROG2 

6 

10 

PROG3 

0 

1 

PROG4 

0 

* 10 


Fig. 7. Conformance checking results 


due to these issues. We also manually added summary functions in our analysis 
for calls to certain system routines that terminate the program: our summary 
functions treat these calls as returning a _L dataflow value for all file states, thus 
simulating termination. In this experiment we use CP (Constant Propagation) 
as the underlying analysis U for our lifted approach. 

Figure [^summarizes the results of this analysis. For each program, the sec¬ 
ond column captures the number instances of a file state of the well-formed 
automaton having a non-_L value at a rejection point. These are basically the 
under-acceptance warnings. The third column depicts the number of file states 
of the full automaton (excluding the final states of the original well-formed au¬ 
tomaton) that reach the final point of the “main” procedure with a non-T value. 
These are basically the over-acceptance warnings. 

The running time of the analysis was a few seconds or less on all programs 
except PROG4. On this very large program the analysis took 3700 seconds. 


Discussion of under-acceptance results. A noteworthy aspect of these results 
is that four of the eight programs, namely, ACCTRAN, DTAP, PROG3, and 
PROG4 have been verified as having no under-acceptance errors. In the case of 
CLIEOP, some of the under-acceptance warnings turned out to be true positives 
during manual examination, in that the code contained programming errors that 
cause rejection of well-formed files. 

We also manually examined one other program for which there were warn¬ 
ings - SEQ2000. Although this program is a textbook program, it follows a a 
complex idiom. Certain fields in certain record types in the input file format for 
this program are supposed to contain values that appear as primary keys in a 
sorted persistent table that is accessed by the program. However, the well-formed 
automaton that we created does not capture this constraint, and is hence over¬ 
approximated This caused false under-acceptance warnings to be reported. 


Discussion of over-acceptance results. As is clear from the table, our imple¬ 
mentation reports over-acceptance warnings on all the programs. (The numbers 
marked with a are potentially lower than they should really be, because, 
as mentioned in Section |5.1[ we did not actually use a full automaton for these 


This program uses sequential lookup on the persistent table, which is an idiom that 
our persistent-stores extension (Section 3.41 does not support. 
















two programs.) We manually examined four of these programs, and report our 
findings below. 

Warnings reported for two of the programs - DTAP and PROG4 - turned 
out genuine. The input file-format for DTAP is similar to the one shown in 
Figurej^a) (the difference is that it uses single state qt in place of Qsh and qdh)- 
This program happens to accept files that contain batches in which a header 
record and a trailer record occur back-to-back without any intervening item 
records, which is a violation of the specification. In the case of PROG4, when we 
discussed the warnings with the maintainers of the program, they agreed that 
some of them were genuine. However, at present, there is another program that 
runs before PROG4 in their standard workflow that ensures that ill-formed files 
are not supplied to PROG4. 

In the case of SEQ2000 and PROGS, the well-formed automatons were over¬ 
approximated. There is one other challenging idiom in SEQ2000, which also 
contributes to imprecision. Some of the routines that emit warnings emit file- 
conformance warnings when called from certain call-sites, and other kinds of 
warnings when called from other call-sites. Since we currently do not have a 
context-sensitive scheme to mark rejection points, we left these routines un¬ 
marked as a conservative gesture. 

5.3 Program specialization 


S. No 

Program 

name 

Criterion 

name 

Criterion- 

specific 

nodes 

Common 

nodes 

1 

ACCTRAN 

Deposit 

1 

41 

2 

ACCTRAN 

Withdraw 

30 

3 

SEQ2000 

Add 

14 

84 

4 

SEQ2000 

Change 

14 

5 

SEQ2000 

Delete 

6 

6 

DTAP 

DDBank 

5 

216 

7 

DTAP 

DDCust 

7 

8 

DTAP 

CTBank 

5 

216 

9 

DTAP 

CTGust 

7 

10 

CLIEOPP 

Payments 

22 

622 

11 

CLIEOPP 

DirectDebit 

98 

12 

PROGl 

Edit 

17 

511 

13 

PROGl 

Update 

215 

14 

PROG2 

Form 

3 

644 

15 

PROG2 

Telex 

5 

16 

PROG2 

Modified 

5 

17 

PROGS 

TranCopy 

37 

671 

12 

PROG4 

DAccts 

2 

30825 

13 

PROG4 

MAccts 

1047 


Fig. 8. Specialization criteria and results 


The objective of this experiment is to evaluate the effectiveness of our ap¬ 
proach in identifying program statements that are relevant to given criteria that 
are specihed as specialization automatons. In this experiment we used GP as the 


























underlying analysis U. We ran our tool multiple times on each program, each 
time with a different specialization criterion that we identified which represents 
a meaningful functionality from the end-user perspective. For instance, consider 
the program SEQ2000. The input file to this program consists of a sequence of 
request records, with each request being either to “Add” an item to the inventory 
(which is stored in a persistent table), to “Change” the details of an item in the 
inventory, or to “Delete” an item from the inventory. A meaningful criterion for 
this program would be one that is concerned only about “Add” requests. Simi¬ 
larly, “Change” and “Delete” are meaningful criteria. Figure summarizes the 
results from this experiment. Each row in the figure corresponds to a program- 
criterion pair. The third column in the figure indicates the mnemonic name 
that we have given to each of our criteria. Note that for PROG3, the criterion 
TranCopy specializes the program to process one of twelve kinds of input record 
types. While we have done the specialization with all 12 criteria, for brevity we 
report only one of them in the figure (i.e., TranCopy). 


Results. For any criterion, the sum of the numbers in the fourth and fifth columns 
in the figure is the number of CFG nodes that were determined by our analysis 
as being relevant to the criterion (i.e., were reached with a non-T value under 
some file state with the specialization automaton). For instance, for ACCTRAN- 
Deposit, the number of relevant CFG nodes is 42 (out of a total of 73 nodes in 
the program - see Figure]^. The fifth column indicates the number of (common) 
nodes that were relevant to all of the criteria supplied, while the fourth column 
indicates the number of nodes that were relevant to the corresponding individual 
criterion but are not common to all the criteria. Note that in the case of DTAP 
we show commonality not across all four criteria, but within two subgroups each 
of which contains two (related) criteria. Also, in the case of PROG3 the common 
nodes depicted are across all twelve criteria. 

It is notable that in most of the programs the commonality among the state¬ 
ments that are relevant to the different criteria is high, while statements that 
are specific to individual criteria are fewer in number. Our belief is that in a 
program comprehension setting the ability of a developer to separately view 
common code and criterion-specific code would let them appreciate in a better 
way the processing logic that underlies each of these criteria. 


Manual examination. We manually examined the output of the tool to determine 
its precision. We did this for all programs except PROGl, PROG2, and PROG4 
which had difficult control-flow as well as logic which made manual evaluation 
difficult. To our surprise, the tool was 100% precise on every criterion for four of 
the remaining programs - ACCTRAN, CLIEOP, DTAP, and PROG3. That is, it 
did not fail to mark as unreachable any GFG node that was actually unreachable 
(as per our human judgment) during executions on input files that conformed to 
the given specialization automaton. This is basically evidence that specialization 
automatons (in conjunction with GP as the underlying analysis) are a sufficiently 
precise mechanism to specialize file-processing programs. 


The remaining one program is SEQ2000, for which, as discussed in Sec¬ 
tion [5]^ we have an over-approximated input automaton. Although the special¬ 
ized program does contain extra statements that should ideally be removed, the 
result actually turns out to be 100% precise relative to the given automaton. 


5.4 Precision improvement of existing analyses 


As discussed in Section 1.2 there are scenarios where one is interested in perform¬ 
ing standard analyses on a program, but restricted to paths that can be taken 
during runs on well-formed files only. To evaluate this scenario we implemented 
two analyses. One is a possibly uninitialized variables analysis, whose abstract 
domain we call Uninit, wherein one wishes to locate references to variables that 
have either not been initialized, or have been initialized using computations that 
in turn refer to possibly uninitialized variables. The second is a reaching defi¬ 
nitions analysis, whose abstract domain we call RD. We ran each of these two 
analyses in two modes: a “direct” mode, where the analysis is run as-is, and a 
“lifted” mode, where the analysis is done by lifting it with a well-formed au¬ 
tomaton. In the lifted mode, for Uninit we used CP x Uninit as the underlying 
analysis U, while for RD we used CP x RD as the underlying analysis. (The CP 
component is required to enable path-sensitivity, as was illustrated in Figure]^) 

In the interest of space we summarize the results. With Uninit, 82.5% of all 
variable references in SEQ2000 were labeled as uninitialized in the direct mode, 
whereas only 25.7% were labeled so in the lifted mode. For DTAP, the analogous 
numbers are 61.4% and 9.6%. In other programs the lifted mode performed only 
marginally better than the direct mode. 

In the case of RD, the total number of def-use edges computed by the lifted 
mode were 12% below those computed by the direct mode for DTAP, 23% below 
for CLIEOPP, and 15% below for PROG2. In the other programs the reduction 
was marginal. 

We do not have numbers for these experiments on the large program PROG4, 
as the cross-product domains (mentioned above) do not yet scale to programs 
of these sizes. On the other programs the direct analyses took anywhere from 
a few hundreds of a second to up to 18 seconds, while the lifted analyses took 
anywhere from a few tenths of a second to 180 seconds. 

We did a limited study of some programs where the lifted mode did not 
give significant benefit. Some of the causes of imprecision that we observed were 
array references, and calls to external programs, both of which we handle only 
conservatively. These confounding factors in these programs could not be offset 
by the precision improvement afforded by the input automatons. 


5.5 Discussion 

In summary, we are very encouraged by our experimental results. Except the 
two smaller programs - ACCTRAN and SEQ2000 - our benchmark programs 
are either real, or work on real formats and implement real specifications. 



File-format conformance checking and program specialization are two novel 
problems in whose context we have evaluated our tool. The tool verified four 
programs as not rejecting any well-formed files, and found genuine file-format 
related errors in several other programs. The tool was very precise in the program 
specialization context. Finally, it enabled non-trivial improvement in precision 
in the context of uninitialized variables or reaching definitions analysis on four 
of the eight programs. 


6 Related Work 

We discuss related work broadly in several categories. 

Analysis of record- and file-processing programs. There exists a body of lit¬ 
erature, of which the work of Godefroid et al. [I9] and Saxena et al. [29] are 
representatives, on testing of programs whose inputs are described by gram¬ 
mars or regular expressions, via concolic execution. Their approaches are more 
suited for bug detection (with high precision), while our approach is aimed at 
conservative verification, as well as program understanding and transformation 
tasks. 

Various approaches have been proposed in the literature to recover record 
types and file types from programs by program analysis [241811011411 .H] . These 
approaches complement ours, by being potentially able to infer input automatons 
from programs in situations where pre-specified file formats are not available. 

A report by Auguston [T] shows the decidability of verifying certain kinds of 
assertions in file-processing programs. 

Program specialization. Blazy et al. [5] describe an approach to specialize 
Fortran programs using constant propagation. There is a significant body of lit¬ 
erature on the technique of partial evaluation [22] , which is a sophisticated form 
of program specialization, involving loop unrolling to arbitrary depths, simpli¬ 
fication of expressions, etc. These approaches typically support only criteria on 
fixed sized program inputs. Launchbury et al. |25j extend partial evaluation to 
allow criteria on data structures. Consel et al. [5| provide an interesting vari¬ 
ant of partial evaluation, wherein they propose an abstract-interpretation based 
framework to specialize functional programs with abstract values such as signs, 
types and ranges. Our approach could potentially be framed as an instantiation 
of their approach, with an input-automaton-based “lifted” lattice, and corre¬ 
sponding lifted transfer functions. 

Program slicing. Program slicing |35| is widely applicable in software engi¬ 
neering tasks, usually to locate the portion of a program that is relevant to a 
criterion. The constrained variants of program slicing [1614120] provide good pre¬ 
cision in general, at the cost of being potentially expensive. Existing approaches 
for constrained slicing do not specifically support constraints on the record se¬ 
quences that may appear in input files of file-processing programs. Our “lifted” 
reaching-definitions analysis, which we described in Section [5.4[ enables this sort 
of slicing. 






Typestates. There is a rich body of literature in specifying and using type 
states, with the seminal work being that of Strom et al. |32j . In the context 
of analyzing file-processing programs, type-state automatons have been used 
to capture the state of a file (e.g., “open”, “closed”, “error”) um- To our 
knowledge ours is the first work in this space to use automatons to encode 
properties of the prefix of records read from a file. 

Shape analysis. Shape analysis [28] is a precise but heavy-weight technique 
for verifying shapes and other properties of in-memory data structures. While 
at a high level a data file is similar to an in-memory list, the operations used to 
traverse files and in-memory data structures are very different. To our knowledge 
shape analysis has not been used in the literature to model the contents and 
states of files as they are being read in programs. It would be an interesting 
topic of future work to explore in-depth the feasibility of such an approach. 

7 Conclusions and Future Work 

We presented in this paper a novel approach to apply any given abstract inter¬ 
pretation on a file-processing program that has an associated input file-format. 
The file-format basically enables our approach to elide certain paths in the pro¬ 
gram that are infeasible as per the file format, and hence enhance the precision 
and usefulness of the underlying analysis. We have demonstrated the value of our 
approach using experiments, especially in the context of two novel applications: 
file format conformance checking, and program specialization. 

A key item of future work is to allow richer constraints on the data in the in¬ 
put file and persistent tables; for instance, general logical constraints, constraints 
expressing sortedness, etc., would be useful in many settings to obtain enhanced 
precision and usefulness. Also, we would like to investigate our techniques on 
domains other than batch programs; e.g., to image-processing programs, XML- 
processing programs, and web-based applications. 
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A Appendix 

A.l Precision of our approach 

We discuss here the precision of our approach, which was alluded to in Sec¬ 
tion in more detail. An L-solution is a function from program points in the 
given program P to dataflow values from the lattice L. A (Q, L)-solution is a 
function from program points to functions in the domain Q ^ L, where Q is the 
set of file states in an input automaton. 

Given a {Q, L) solution g and an L-solution / (for the same program P), we 
say that g is more precise than / iff at each program point p: 

LJ9eQ{(5(p))(9)} El f{p) 

Note that we are actually using “more precise” as shorthand for “equally 
precise or more precise”. 

Let fi be the fix-point L-solution obtained for program P by directly using 
the underlying analysis U. Our first key result is as follows. 

Theorem 2. If S is any input automaton for P with set of files states Q, then 
the {Q, L) fix-point solution computed by our approach using S and using the 
underlying analysis U is more precise than fix-point solution computed by U 
directly. 

Intuitively, the above theorem captures the fact that the path-sensitivity that 
results from tracking different dataflow values (from lattice L) for different file 
states causes increase in precision. Note that the theorem above does not touch 
upon soundness. In order to ensure soundness, S would additionally need to 


accept all well-formed files or all files, depending on the notion of soundness 
that is sought. 

A different question that naturally arises is, if there are multiple candidate 
well-formed automatons that all accept the same set of well-formed files, will 
they all give equally precise results when used as part of our analysis? The 
answer, in general, turns out to be “no”. It can also be shown that if an input 
automaton accepts a smaller set of files than another automaton, then the first 
automaton need not necessarily give more precise results than the second one 
on all programs. In fact, precision is linked both to the set of files accepted as 
well as to the structure of the automatons themselves. 

In order to formalize the above intuition, we first define formally the notion 
of a precision ordering on different solutions for a program P using different 
input automatons. A (Qi, L) solution gi is said to be more precise than another 
{Q 2 , L) solution g 2 iff at each program point p, for each file state qi & Qi, there 
exists a file state q 2 € Q 2 such that: 

(5i(p))(9i) El {92{p)){q2) 

We then define a notion of refinement among input automatons for the same 
program P. We say that automaton S 2 = (<52, ^ 2 , ^ 2 , <Zs 2 , Qe 2 ) is a refinement 
of automaton = (Qi, Ai, Z\i, Qei) iff there exists a mapping function 
m : Q 2 ^ Qi such that: 

- rn{qs 2 ) = 9 * 1 , and 

- For each transition p 2 —> (72 in ^2 labeled with some symbol S 2 G S 2 ' There 
exists one or more transitions from m(p 2 ) to m{q 2 ) in ^i. Furthermore, if si is 
the label on any of these transitions, then either (1) S 2 and si are both eof, or 
(2) S 2 and si are both types, and S 2 ’s constraint implies si’s constraint. 

If an input automaton S 2 is a refinement of an input automaton , then the 
following two properties can be shown to hold: (I) each accepting state of S 2 is 
mapped by m to an accepting state of S'!, and (2) the set of files accepted by S 2 
is a subset of files accepted by Si. Now, our main result on precision ordering of 
input automatons is as follows. 

Theorem 3. For any program P and for any given underlying analysis U, if an 
input automaton S 2 for P is a refinement of an input automaton Si for P, then 
the fix-point solution computed by our approach using S 2 and U is more precise 
than the fix-point solution computed by our approach using Si and U. 

An important take away from the above theorem is: When there is a choice 
between two input automatons that accept the same set of files (e.g., two different 
well-formed automatons accepting the same set of well-formed files), if one of 
them is a refinement of the other then the refined automaton will give more 
precision than one of which it is a refinement. 



A.2 Checking over-acceptance errors 


We discuss here a procedure to extend a given well-formed automaton S = 
(Q, S = TU {eof}, A, qs,Qe) into a full automaton. We first create a new type 
named “NA” (none of the above), and associate with it a constraint that lets it 
cover all records that are not covered by any of the types in the original set of 
types T. We also add a new file state to the well-formed automaton, which we 
denote as qy in this discussion. Let T' = T U {NA}, and Q' = Q U {qy}. For 
every state q in Q', and for every type t in T', if there is no transition labeled t 
out of q we add a transition from q to qy labeled t. Finally, we add one more new 
file state qx to the automaton, make it a final state, add eof transitions from 
all non-final states to this state. The intuition behind this construction is that 
qx accepts all ill-formed files, while qy accepts all record sequences that are not 
prefixes of well-formed files. 



